[configure] Backend Performance Requirements for etcd#148
[configure] Backend Performance Requirements for etcd#148
Conversation
WalkthroughA new troubleshooting documentation page was added for etcd backend performance degradation, providing issue characterization, root-cause explanation, and resolution guidance with diagnostic procedures, commands, and monitoring thresholds. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md`:
- Around line 64-72: Add an explicit safety note to the etcd defragmentation
snippet instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.
- Around line 13-18: The fenced log block is missing a language tag which
triggers markdownlint MD040; update the code fence that contains the lines
beginning with "etcdserver: failed to send out heartbeat..." and the subsequent
etcdserver/wal lines by adding a language identifier (e.g., "text") after the
opening ``` so the block reads like ```text to satisfy the linter and preserve
formatting.
- Around line 41-45: The fio benchmark currently uses the --fsync=1 flag which
measures fsync (data+metadata) but the documented SLA and WAL behavior require
measuring fdatasync; update the fio invocation in the docs (the command line
containing fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1
so the 99th-percentile fdatasync latency threshold (<10 ms) is validated
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e512c52c-72ca-4013-a6cf-cbe02c53c390
📒 Files selected for processing (1)
docs/en/solutions/Backend_Performance_Requirements_for_etcd.md
| ``` | ||
| etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms) | ||
| etcdserver: server is likely overloaded | ||
| etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute | ||
| wal: sync duration of xxxx s, expected less than 1s | ||
| ``` |
There was a problem hiding this comment.
Add a language tag to the log code fence.
The fenced block at Line 13 is missing a language identifier, which triggers markdownlint MD040.
✅ Suggested patch
-```
+```text
etcdserver: failed to send out heartbeat on time (exceeded the 100ms timeout for xxx ms)
etcdserver: server is likely overloaded
etcdserver: read-only range request "key:\"xxxx\"" count_only:true with result "xxxx" took too long (xxx s) to execute
wal: sync duration of xxxx s, expected less than 1s</details>
<details>
<summary>🧰 Tools</summary>
<details>
<summary>🪛 markdownlint-cli2 (0.22.0)</summary>
[warning] 13-13: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
</details>
</details>
<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against the current code and only fix it if needed.
In @docs/en/solutions/Backend_Performance_Requirements_for_etcd.md around lines
13 - 18, The fenced log block is missing a language tag which triggers
markdownlint MD040; update the code fence that contains the lines beginning with
"etcdserver: failed to send out heartbeat..." and the subsequent etcdserver/wal
lines by adding a language identifier (e.g., "text") after the opening so the block reads liketext to satisfy the linter and preserve formatting.
</details>
<!-- fingerprinting:phantom:triton:hawk:3a43dc90-80f7-42a3-bcac-1dedf608d1de -->
<!-- This is an auto-generated comment by CodeRabbit -->
| fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \ | ||
| --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based | ||
| ``` | ||
|
|
||
| The 99th percentile fdatasync latency must be under **10 ms**. |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
In fio, what is the behavioral difference between --fsync=1 and --fdatasync=1, and which one best matches etcd WAL fdatasync latency validation guidance?
💡 Result:
In fio, --fsync=1 issues fsync(2) after every block written, syncing both data and metadata to disk. --fdatasync=1 issues fdatasync(2) after every block, syncing only data (not metadata unless required for data retrieval), falling back to fsync on unsupported platforms. --fdatasync=1 best matches etcd WAL fdatasync latency validation guidance. Etcd primarily uses fdatasync for WAL syncs (fsync only on create/close/repair/cut), and official validation uses fio with --fdatasync=1 (e.g., fio --rw=write --ioengine=sync --fdatasync=1 --bs=2300 --size=XXm), checking 99th percentile fdatasync latency <10ms.
Citations:
- 1: https://www.flamingbytes.com/blog/fio-fsync-end_fsync-fdatasync-and-sync/
- 2: https://fio.readthedocs.io/en/latest/fio_doc.html
- 3: discuss: fdatasync over fsync in WAL etcd-io/etcd#18768
- 4: https://prohoster.info/en/blog/administrirovanie/kak-s-fio-proverit-diski-na-dostatochnuyu-proizvoditelnost-dlya-etcd
- 5: https://docs.okd.io/latest/etcd/etcd-practices.html
- 6: https://eng.d2iq.com/blog/etcd-performance-benchmarking/
- 7: https://etcd.io/docs/v3.4/op-guide/hardware
🏁 Script executed:
fd "Backend_Performance_Requirements_for_etcd.md" --type fRepository: alauda/knowledge
Length of output: 122
🏁 Script executed:
sed -n '41,45p' docs/en/solutions/Backend_Performance_Requirements_for_etcd.mdRepository: alauda/knowledge
Length of output: 266
Change --fsync=1 to --fdatasync=1 to match the documented threshold.
The fio command uses --fsync=1 (which syncs data and metadata), but the threshold requires fdatasync latency <10ms (which syncs data only). This mismatch means the benchmark measures the wrong primitive. etcd WAL operations primarily use fdatasync, not fsync. Update the command to --fdatasync=1 to correctly validate etcd performance.
Suggested patch
fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \
- --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based
+ --rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \ | |
| --rw=write --iodepth=1 --fsync=1 --runtime=30 --time_based | |
| ``` | |
| The 99th percentile fdatasync latency must be under **10 ms**. | |
| fio --name=etcd-io-test --ioengine=sync --bs=4k --numjobs=1 --size=512M \ | |
| --rw=write --iodepth=1 --fdatasync=1 --runtime=30 --time_based |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
41 - 45, The fio benchmark currently uses the --fsync=1 flag which measures
fsync (data+metadata) but the documented SLA and WAL behavior require measuring
fdatasync; update the fio invocation in the docs (the command line containing
fio --name=etcd-io-test ...) to replace --fsync=1 with --fdatasync=1 so the
99th-percentile fdatasync latency threshold (<10 ms) is validated correctly.
| If the database size approaches the quota, perform manual defragmentation: | ||
|
|
||
| ```bash | ||
| kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \ | ||
| --endpoints=https://127.0.0.1:2379 \ | ||
| --cacert=/etc/kubernetes/pki/etcd/ca.crt \ | ||
| --cert=/etc/kubernetes/pki/etcd/server.crt \ | ||
| --key=/etc/kubernetes/pki/etcd/server.key | ||
| ``` |
There was a problem hiding this comment.
Add a defrag safety note (one member at a time).
This runbook should explicitly instruct sequential defragmentation (not all members concurrently) to reduce control-plane disruption risk.
✅ Suggested patch
### Database Defragmentation
If the database size approaches the quota, perform manual defragmentation:
+Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member.
```bash
kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| If the database size approaches the quota, perform manual defragmentation: | |
| ```bash | |
| kubectl exec -n kube-system etcd-<node-name> -- etcdctl defrag \ | |
| --endpoints=https://127.0.0.1:2379 \ | |
| --cacert=/etc/kubernetes/pki/etcd/ca.crt \ | |
| --cert=/etc/kubernetes/pki/etcd/server.crt \ | |
| --key=/etc/kubernetes/pki/etcd/server.key | |
| ``` | |
| If the database size approaches the quota, perform manual defragmentation: | |
| Run defragmentation on **one etcd member at a time** and wait for the member to become healthy before moving to the next member. | |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/en/solutions/Backend_Performance_Requirements_for_etcd.md` around lines
64 - 72, Add an explicit safety note to the etcd defragmentation snippet
instructing operators to run etcdctl defrag on one etcd member at a time
(sequentially, not concurrently) using the existing kubectl exec ...
etcd-<node-name> -- etcdctl defrag command; update the paragraph around the
command (referencing the "etcdctl defrag" and "kubectl exec -n kube-system
etcd-<node-name>" text) to state clearly to perform defrag on a single member,
wait for that member to rejoin/settle, then proceed to the next member to avoid
control-plane disruption.
新增一篇 ACP KB 文章,归入
configure区域。✅ 自动化验证通过 — 3 / 3 条验证步骤在真实 Kubernetes 集群上按文章命令跑通(2026-04-22T13:12:08Z)。
configure区域建议 reviewer按
kb/OWNERS.md+kb/KB_REVIEWERS.md该区域的活跃人自动挑选,@ 错了请无视。@changluyi @zhangzujian @oilbeater
没有 GitHub handle 的贡献者(本区域相关请人工 ping):